Take Home Exercise 2

An article that explore the creation of a age-sex pyramid representing Singapore population data collected on the month of June from year 2000 to 2020 programatically using R

Clement Ong (Cosq (Github Repo))https://github.com/Clementong
2022-02-04

1 Introduction

1.1 Overview

This case study aims to portray the age-sex structural trend that shifts from 2000 to 2020 for the population of Singapore. Different from Take Home Exercise 1, we will learn how to convert the population pyramid static chart created using ggplot2 into an interactive one; using interactive charting packages for R.

More details pertaining to the data and graphs will be explained in later sections.

2 Case Study : Singapore Population trend

The aim of this case study explores the creation of a population pyramid representing Singapore population data collected on the month of June from year 2000 to 2020, using R.

2.1 Overview and Dataset

The Singapore Residents by Planning Area/Subzone,AgeGroup.Sex and Type of Dwelling Singstat Dataset representing the number of people staying in different regions of Singapore by age cohort, sex and dwelling details. This time, we will be making use of two data sets from the same source that would need to be combined later in order to represent the complete population data set for our case study.

These are namely:

1.Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2000-2010

  1. Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2011-2020

Notice that these data sets are in separate data frames and there is a need to do some data wrangling to prepare the data for visualization. However, before dwelling into that, we will explore some ideas and inspiration from in-class exercise and previous Take Home Exercise 1 to understand how to create the charts better but first let’s understand the task.

The Task

Build an interactive age-sex pyramid representing the demographic structure of Singapore by age cohort and gender that show the trends of the age-sex structure of Singapore population from 2000 to 2020 at planning area level

Understanding the Task

The Population pyramid is used to present the distribution of different age cohort representing a particular population. It is well known for its pyramid looking shape. Typically it shows a continuous stacked horizontal histogram bar. The population size is on the x-axis while the age-cohort would be presented on the y-axis. The height of each bar typically represents either an absolute frequency or a percentage of the number of people in each age cohort.

Through the population pyramid, we can understand age-sex structure of the Singapore population and identify the population pyramid trend which can unveal things about fertility and motility and whether it is a shrinking population.

Why Interactive ?

Instead of presenting graphs in a fixed frame, we can create interactive charts.

Interactive charts have become a popular way to allow users to explore the visualize and interact with chart elements to get a better sense/context of the graphical message. The good thing about using dash boarding platforms like tableau, is its ability to provide customization interactive capabilities with a few clicks away. This often contain animations and interactive element customization to provide a unique chart exploration experience.

Notice that in R, this capability is not automatically applied to visuals created using the ggplot2 library. Instead, we can utilize interactive plotting elements from other libraries to create these interactive elements.

How does it work?

We need to understand how interactive charts would work in R. There are two known approaches to create interactive charts:

  1. Building a ggplot geometric objects and utilize interactive packages to add on the capabilities interactivity
  2. Building using the interactive package itself like plotly

Before we begin to select the method, we will sketch out the graph that we are going to make.


2.2 Proposed Sketch and Inspiration

These sketches were inspired by interactive capabilities learnt from in class exercise 3 on tableau.

Population Pyramid

A population pyramid shows the distribution of a population by age group and sex. The term pyramid is used to depict a growing population, which most countries hope to achieve. In the context of Singapore, this population pyramid will depict Singapore’s population distribution and the colors, blue and pink will be used the males and females respectively. Since in this task, we are to provide the pyramid for all planning areas, we will create the drop down menu for users to select the planning area they want to view and a play button to allow users to view the population trend over the years from 2000 to 2020.

The Inspiration and Changes

population_sketch

Notice from the sketch that we are going to build on to a proposed interactive chart shown in Take Home Exercise 1. The above sketch have improved on the following :

  1. Notice that we have included a tooltip focused on the population value rather than including extra information.

  2. Notice that we have included a dropdown menu that provides the user with a way to filter for planning areas in R (inspired by Tableau filters)

  3. Notice that a timeline player is included to add extra intractivity for users to be able to view the population structure over time from 2000 to 2020

Overall, the focus here is to understand the use of customisation propertises in which we will be building our graph differently from take of Take Home Exercise 1.

3 Required Libraries

This section provides a summary of the packages required for this exercise.

The following code chunk will check if the required libraries are installed first before loading them into the R environment.

packages = c('ggiraph', 'plotly', 
             'DT', 'patchwork',
             'gganimate', 'tidyverse',
             'readxl', 'gifski', 'kableExtra', 'knitr','crosstalk')
for (p in packages){
  if(!require(p,character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

4 Dataset Challenges


The following are the data challenges faced:

  1. Notice that the data representing the population trend from 2000 to 2020 are in separate files. As such, there will be a need to combine these data. We also need to verify that the union of the files if successful

  2. We need to check for abnormal data fields and missing values and exclude them from the data set. We also need to remember to factorize the age group column as we did in Take Home Exercise 1 using factor()

3.Notice that this time, there is a need to be aware of the multiple columns and groups we need to group_by() later on using dyplr package and other data preparation methods to prepare the data

4.Knowing that we have a way to already create a ggplot population pyramid, we need to now think of the geometric customization of the interactive libraries to produce the same graph but with more intuitive feature for interaction and understanding.

4.1 Data Prepation

The data set is of ‘.csv’ extension which equates to comma separated field format. As such, the read_csv function using the readr library can be used as seen below.

#Reading the Data
pop_2010  <- read_csv('./data/respopagesextod2000to2010.csv')
pop_2020 <- read_csv('./data/respopagesextod2011to2020.csv')

dis1 <- head(pop_2010)
dis2 <- head(pop_2020)

Here are have Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2000-2010

PA SZ AG Sex TOD Pop Time
Ang Mo Kio Cheng San 0_to_4 Males HDB 1- and 2-Room Flats 20 2000
Ang Mo Kio Cheng San 0_to_4 Males HDB 3-Room Flats 480 2000
Ang Mo Kio Cheng San 0_to_4 Males HDB 4-Room Flats 220 2000
Ang Mo Kio Cheng San 0_to_4 Males HDB 5-Room and Executive Flats 80 2000
Ang Mo Kio Cheng San 0_to_4 Males HUDC Flats (excluding those privatised) 0 2000
Ang Mo Kio Cheng San 0_to_4 Males Landed Properties 0 2000

and here we have Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2011-2020

PA SZ AG Sex TOD Pop Time
Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 1- and 2-Room Flats 0 2011
Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 3-Room Flats 10 2011
Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 4-Room Flats 30 2011
Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HDB 5-Room and Executive Flats 50 2011
Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males HUDC Flats (excluding those privatised) 0 2011
Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males Landed Properties 0 2011

Next, we will need to check for missing values and remove them if any. The code below checks for missing values. Notice that both data set do not have missing values.

# check for missing values, rows with more 70% missing
print(pop_2010[rowSums(is.na(pop_2010)) >= 0.7,])
# A tibble: 0 × 7
# … with 7 variables: PA <chr>, SZ <chr>, AG <chr>, Sex <chr>,
#   TOD <chr>, Pop <dbl>, Time <dbl>
print(pop_2020[rowSums(is.na(pop_2020)) >= 0.7,])
# A tibble: 0 × 7
# … with 7 variables: PA <chr>, SZ <chr>, AG <chr>, Sex <chr>,
#   TOD <chr>, Pop <dbl>, Time <dbl>

Now we will need to check the columns to ensure that both tables contain the same number of columns before we union them together. The code chunk below shows the name of the columns of the two data frames. It seems that they contain the same number of columns and so we will combine them.

# checking the columns before a union 
print(names(pop_2010))
[1] "PA"   "SZ"   "AG"   "Sex"  "TOD"  "Pop"  "Time"
print(names(pop_2020))
[1] "PA"   "SZ"   "AG"   "Sex"  "TOD"  "Pop"  "Time"
# all same + the data type is same 

Combining the columns using the union() operation by dplyr will allow us to merge the datasets one under another, merge to the bottom. Following the printed tables, on the top, we have the table where it shows to be from 2000 and the bottom the table where it shows to be 2020.

# union these data set 
combined_pop <- union(pop_2010, pop_2020)
PA SZ AG Sex TOD Pop Time
Ang Mo Kio Cheng San 0_to_4 Males HDB 1- and 2-Room Flats 20 2000
Ang Mo Kio Cheng San 0_to_4 Males HDB 3-Room Flats 480 2000
Ang Mo Kio Cheng San 0_to_4 Males HDB 4-Room Flats 220 2000
Ang Mo Kio Cheng San 0_to_4 Males HDB 5-Room and Executive Flats 80 2000
Ang Mo Kio Cheng San 0_to_4 Males HUDC Flats (excluding those privatised) 0 2000
Ang Mo Kio Cheng San 0_to_4 Males Landed Properties 0 2000
PA SZ AG Sex TOD Pop Time
Yishun Yishun West 90_and_over Females HDB 4-Room Flats 60 2020
Yishun Yishun West 90_and_over Females HDB 5-Room and Executive Flats 20 2020
Yishun Yishun West 90_and_over Females HUDC Flats (excluding those privatised) 0 2020
Yishun Yishun West 90_and_over Females Landed Properties 0 2020
Yishun Yishun West 90_and_over Females Condominiums and Other Apartments 10 2020
Yishun Yishun West 90_and_over Females Others 30 2020

4.2 Data Checking

To ensure that data have successfully merge, lets ensure that the rows combined are the same.

# checking the number of rows 
print(nrow(pop_2010) + nrow(pop_2020)) 
[1] 2025248
print(nrow(combined_pop))
[1] 2025248
# Turned out same

After successfully merging the data sets, lets check if the values of the categorical column and ensure that they are consistent to ensure data quality. The code chunk below would do the former.

# checking combined data for missing values
print(unique(combined_pop$PA)) # Not stated is found, exclude using dyplr mutate
 [1] "Ang Mo Kio"              "Bedok"                  
 [3] "Bishan"                  "Boon Lay/Pioneer"       
 [5] "Bukit Batok"             "Bukit Merah"            
 [7] "Bukit Panjang"           "Bukit Timah"            
 [9] "Central Water Catchment" "Changi"                 
[11] "Changi Bay"              "Choa Chu Kang"          
[13] "Clementi"                "Downtown Core"          
[15] "Geylang"                 "Hougang"                
[17] "Jurong East"             "Jurong West"            
[19] "Kallang"                 "Lim Chu Kang"           
[21] "Mandai"                  "Marina East"            
[23] "Marina South"            "Marine Parade"          
[25] "Museum"                  "Newton"                 
[27] "North-Eastern Islands"   "Novena"                 
[29] "Orchard"                 "Outram"                 
[31] "Pasir Ris"               "Paya Lebar"             
[33] "Punggol"                 "Queenstown"             
[35] "River Valley"            "Rochor"                 
[37] "Seletar"                 "Sembawang"              
[39] "Sengkang"                "Serangoon"              
[41] "Simpang"                 "Singapore River"        
[43] "Southern Islands"        "Straits View"           
[45] "Sungei Kadut"            "Tampines"               
[47] "Tanglin"                 "Tengah"                 
[49] "Toa Payoh"               "Tuas"                   
[51] "Western Islands"         "Western Water Catchment"
[53] "Woodlands"               "Yishun"                 
[55] "Not Stated"              "Boon Lay"               
[57] "Pioneer"                
print(unique(combined_pop$AG)) 
 [1] "0_to_4"      "5_to_9"      "10_to_14"    "15_to_19"   
 [5] "20_to_24"    "25_to_29"    "30_to_34"    "35_to_39"   
 [9] "40_to_44"    "45_to_49"    "50_to_54"    "55_to_59"   
[13] "60_to_64"    "65_to_69"    "70_to_74"    "75_to_79"   
[17] "80_to_84"    "85_to_89"    "90_and_over"
print(unique(combined_pop$Sex))
[1] "Males"   "Females"

From the above, it seems like *Not Stated* value of the PA (Planning Area) Column show to be rows that can affect the data quality of the visualization and analysis. As such, during the data wrangling stage, we will exclude rows with this value

Also notice that there are many planning areas. We will go ahead and perform another data quality check to ensure that all areas have population values.

  1. We will select() all planning locations
  2. group_by() planning locations and summarised() sum their total population values
  3. filter() for all sums that are 0
  4. select() only the planning area and finally ungroup()
  5. print the list of locations with no population values as a vector
  6. Noticed from the Dataframe, there is also a weird location called Museum which will be excluded
Location_pop <- combined_pop %>%
  select(PA, Pop) %>%
  group_by(PA) %>%
  summarise(total=sum(Pop)) %>%
  filter(total==0) %>%
  select(PA) %>%
  ungroup()

kable(Location_pop)
PA
Boon Lay
Boon Lay/Pioneer
Central Water Catchment
Changi Bay
Marina East
Marina South
Paya Lebar
Pioneer
Simpang
Straits View
Tengah
Tuas
Western Islands
no_pop_locations <- as.vector(Location_pop$PA)
no_pop_locations <- c(no_pop_locations, c("Museum"))
print(no_pop_locations)
 [1] "Boon Lay"                "Boon Lay/Pioneer"       
 [3] "Central Water Catchment" "Changi Bay"             
 [5] "Marina East"             "Marina South"           
 [7] "Paya Lebar"              "Pioneer"                
 [9] "Simpang"                 "Straits View"           
[11] "Tengah"                  "Tuas"                   
[13] "Western Islands"         "Museum"                 

These locations will be filtered out in data wrangling. Removing these locations will also help with the interactie chart render time.

4.3 Data Wrangling

Upon successfully checking the quality and preparing the raw data to be wrangled, we will proceed with the data wrangling process. The following code chunk will do the following :

  1. select() : Planning Area, Time, Age group, gender and population columns
  2. group_by() : Planning Area, Time, Age group and gender to sum up the population for the respective categories
  3. summarise() : Summaries these respective groups with the total population
  4. arrange() : Sort the output data frame according to Planning Area, Time, Age group and then gender
  5. filter() : Filter out the *Not Stated* value of the PA (Planning Area) Column as mentioned in the previous section. *%in%* is the same as *in* operator used to check if a column contains a value from a list/vector of string values. ! operator in this case would do the opposite
  6. ungroup() : It is good practice to un-grouping your data after every group_by() function
  7. Renaming : We will go ahead and rename the columns to be more readable
# GET DATA
combined_pop_grouped <- combined_pop %>% 
  select(PA,Time,AG, Sex, Pop) %>%
  group_by(PA,Time,AG,Sex) %>% 
  summarise(Total = sum(Pop)) %>%
  arrange(PA,Time,Sex,AG) %>%
  filter(PA != 'Not Stated') %>%
  filter(!PA %in% no_pop_locations) %>%
  ungroup()

# renaming table columns
names(combined_pop_grouped) <- c("Planning_Area","Year","Age_Group","Gender","Population")

Quick check

print(unique(combined_pop_grouped$Planning_Area))
 [1] "Ang Mo Kio"              "Bedok"                  
 [3] "Bishan"                  "Bukit Batok"            
 [5] "Bukit Merah"             "Bukit Panjang"          
 [7] "Bukit Timah"             "Changi"                 
 [9] "Choa Chu Kang"           "Clementi"               
[11] "Downtown Core"           "Geylang"                
[13] "Hougang"                 "Jurong East"            
[15] "Jurong West"             "Kallang"                
[17] "Lim Chu Kang"            "Mandai"                 
[19] "Marine Parade"           "Newton"                 
[21] "North-Eastern Islands"   "Novena"                 
[23] "Orchard"                 "Outram"                 
[25] "Pasir Ris"               "Punggol"                
[27] "Queenstown"              "River Valley"           
[29] "Rochor"                  "Seletar"                
[31] "Sembawang"               "Sengkang"               
[33] "Serangoon"               "Singapore River"        
[35] "Southern Islands"        "Sungei Kadut"           
[37] "Tampines"                "Tanglin"                
[39] "Toa Payoh"               "Western Water Catchment"
[41] "Woodlands"               "Yishun"                 

Next, the total Population size of males for each group will be multiplied by a factor of -1 to vertically flip the values from the positive x-axis to the negative x-axis.

# All males are negative so they go to the left
combined_pop_grouped$Total_Population <- combined_pop_grouped$Population
combined_pop_grouped$Total_Population <- ifelse(combined_pop_grouped$Gender == "Males"
                                                , -1*combined_pop_grouped$Total_Population
                                                , combined_pop_grouped$Total_Population)
Planning_Area Year Age_Group Gender Population Total_Population
Yishun 2020 65_to_69 Males 6080 -6080
Yishun 2020 70_to_74 Males 3840 -3840
Yishun 2020 75_to_79 Males 1790 -1790
Yishun 2020 80_to_84 Males 1160 -1160
Yishun 2020 85_to_89 Males 510 -510
Yishun 2020 90_and_over Males 180 -180
Planning_Area Year Age_Group Gender Population Total_Population
Ang Mo Kio 2000 0_to_4 Females 4460 4460
Ang Mo Kio 2000 10_to_14 Females 5520 5520
Ang Mo Kio 2000 15_to_19 Females 5930 5930
Ang Mo Kio 2000 20_to_24 Females 7160 7160
Ang Mo Kio 2000 25_to_29 Females 7750 7750
Ang Mo Kio 2000 30_to_34 Females 6810 6810

On the bottom, notice female values for total_population is positive while male is negative (top table) this will be used for the barplot element.

Interactive Charts also often make use of tool tips. As such, we will set up the text that we want to display to users. One thing we need to clean is the *Age Group* column. Notice that there are _ (underscores) that act as spaces between the characters.

kable(combined_pop_grouped$Age_Group[1:5])
x
0_to_4
10_to_14
15_to_19
20_to_24
25_to_29

To resolve this, the following code chunk will show how we utilize the sub() function to substitute the _ and create a tool tip column that shows the Gender,Age Group and Population. It will utilize the \n special character to signal for a new line between these fields in the tool tip. Once we have replace the characters, we will factorize the Age Group column; respective age bins (categories) in an ordered manner, with those of ages 0 being the smallest and over 90 being the largest value in the scale.

# Substituting '_'
combined_pop_grouped$Age_Group<-sub('_to_', ' to ', combined_pop_grouped$Age_Group)
combined_pop_grouped$Age_Group<-sub('_and_', ' and ', combined_pop_grouped$Age_Group)

# Creation of the tooltip 
combined_pop_grouped$tooltips <- c(paste0("Gender = ", combined_pop_grouped$Gender
                                          , "\n Age Group = ", combined_pop_grouped$Age_Group
                                          , "\n Population = ", combined_pop_grouped$Population ))

# Similarly, like take home excessive one, we will factorize the age group
combined_pop_grouped$Age_Group <- factor(combined_pop_grouped$Age_Group, ordered=TRUE 
                                         ,levels=c("0 to 4","5 to 9","10 to 14","15 to 19"
                                                   ,"20 to 24","25 to 29","30 to 34"
                                                   ,"35 to 39","40 to 44","45 to 49"
                                                   ,"50 to 54","55 to 59","60 to 64"
                                                   ,"65 to 69","70 to 74","75 to 79"
                                                   ,"80 to 84","85 to 89","90 and over"))

Configuring for a bigger picture

While we have removed a number of places with 0 population, we also realized that there are too many planning areas. To provide a higher level visualization to understand population structures over time, we will go ahead and aggregate the data to provide for higher level view of the data set.

  1. We will first make an empty Region column to group the different planning areas by their region. We will make use of dyplr case when function through the with function to create the conditions to group the planning areas into regions. The regions of Singapore and hand recorded and found on the Singapore Wiki of Region
combined_pop_grouped$Region <- NA
combined_pop_grouped$Region <- with(combined_pop_grouped, dplyr::case_when(
  Planning_Area %in% c('Central Water Catchment','Lim Chu Kang','Mandai'
                       ,'Sembawang','Simpang',
                       'Sungei Kadult','Woodlands','Yishun') ~'North',
  Planning_Area %in% c('Ang Mo Kio'
                       ,'Hougang','North-Eastern   Islands','Punggol','Seletar'
                       ,'Sengkang','Serangoon') ~'North East',
  Planning_Area %in% c('Bedok','Changi','Changi Bay'
                       , 'Paya Lebar','Pasir Ris','Tampines') ~ 'East',
  Planning_Area %in% c('Boon Lay','Bukit Batok','Bukit Panjang'
                       ,'Choo Chu Kang', 'Clementi','Jurong East','Jurong West','Pioneer','Tengah','Tuas','Western Islands'
                       ,'Western Water Catchment') ~ 'West',
TRUE ~ 'Central'))

Also notice that the Age-Groups can be further divided into respective groups like the ‘young’ , the ‘working age’ and the ‘aged’. We have introduced young and old in the first take home exercise but for now we will go ahead and segment these age groups into the follow categories based on this paper which is the age classification by Sing stat.

  1. We will use the same case when function and then use the %in% operator to group the values specific by the respective vectors and call this new column Age_Group_cat
young <- c("0 to 4","5 to 9","10 to 14","15 to 19","20 to 24")
working_age <- c("25 to 29","30 to 34","35 to 39","40 to 44","45 to 49","50 to 54","55 to 59","60 to 64")
aged <- c("65 to 69","70 to 74","75 to 79","80 to 84","85 to 89", "90 and over")

combined_pop_grouped <- combined_pop_grouped %>%
  mutate(Age_Group_cat = case_when(
    Age_Group %in% young ~ "young",
    Age_Group %in% working_age ~ "working_age",
    Age_Group %in% aged ~ "aged",
    ))
  
head(combined_pop_grouped)
# A tibble: 6 × 9
  Planning_Area  Year Age_Group Gender  Population Total_Population
  <chr>         <dbl> <ord>     <chr>        <dbl>            <dbl>
1 Ang Mo Kio     2000 0 to 4    Females       4460             4460
2 Ang Mo Kio     2000 10 to 14  Females       5520             5520
3 Ang Mo Kio     2000 15 to 19  Females       5930             5930
4 Ang Mo Kio     2000 20 to 24  Females       7160             7160
5 Ang Mo Kio     2000 25 to 29  Females       7750             7750
6 Ang Mo Kio     2000 30 to 34  Females       6810             6810
# … with 3 more variables: tooltips <chr>, Region <chr>,
#   Age_Group_cat <chr>

We will go head and create a data frame where we can view the number of aged and young population. To do so, we need the spread column. Spread() function takes in a key and a value column. First it will spread the keys into respective columns and fill it with the value. This allows us to transform categorical columns into respective columns with values for each unique row. Doing so will allow us to obtain a planning area by year and region with the number of population by the three different age group categories. After which will can then :

  1. group_by() year region and planning year
  2. summarise() and obtain the sum of population for each age group cat for each of the respective groups
  3. mutate() obtain and overall population column from the sum of the tree. Notice that we did not use total_population because that was not casted into the spread()function
combined_pop_grouped_inter <- combined_pop_grouped %>%
  mutate(i = row_number()) %>%
  spread(Age_Group_cat, Population) %>%
  select(-i)

# Replace NA with 0 to help with summation in the future 
combined_pop_grouped_inter[is.na(combined_pop_grouped_inter)] <- 0
combined_pop_grouped_inter_agg <-combined_pop_grouped_inter %>%
  group_by(Year,Region,Planning_Area) %>% 
  summarise(Aged=sum(aged),Working_Age=sum(working_age),Young=sum(young)) %>% 
  mutate(Population= Aged + Working_Age + Young)

head(combined_pop_grouped_inter_agg)
# A tibble: 6 × 7
# Groups:   Year, Region [1]
   Year Region  Planning_Area  Aged Working_Age Young Population
  <dbl> <chr>   <chr>         <dbl>       <dbl> <dbl>      <dbl>
1  2000 Central Bishan         6170       52710 31400      90280
2  2000 Central Bukit Merah   18720       90020 40130     148870
3  2000 Central Bukit Timah    5250       37670 21670      64590
4  2000 Central Choa Chu Kang  5470       77240 54320     137030
5  2000 Central Downtown Core   660        2680  1010       4350
6  2000 Central Geylang       12440       69180 37520     119140

As seen in the Tableau example, obtaining this data set will certainly allow us to obtain a sum young/aged animated chart but it would not show the values do not reflect the ratio it has against the working age group. As such, at times, visualization would require us to engineering certain values to better represent the visuals and help with drilling down our analysis.

We will go ahead and create two ratios that is often used by the (Eurostat)[https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Glossary:Young-age_dependency_ratio] community. These are the Young_Dependency_Ration and the Aged_Dependency_Ratio; which is the ratio of the young/aged group against the working age. This will help us better understand the pressure on the working population. Typically, those not in the labor force against those typically in the labor force.

In the below code chunk:

  1. We will use select() for the respective columns and obtain the sum of the Aged, Working_Age and Young category

  2. group_by() the year and region will obtain the sum of the age categories by year and region

3, Mutate and obtain the dependencies ratio and round it off to 1 decimal place. Followed by an ungroup() function

combined_pop_grouped_inter_agg <- combined_pop_grouped_inter_agg %>%
  select(c("Year","Region","Aged","Working_Age","Young","Population")) %>%
  group_by(Year, Region) %>% 
  summarise(Aged=sum(Aged),Working_Age=sum(Working_Age),Young=sum(Young),Population=sum(Population)) %>%
  mutate(Aged_Dependency_Ratio= Aged/Working_Age) %>%
  mutate(Young_Dependency_Ratio= Young/Working_Age) %>%
  mutate(Aged_Dependency_Ratio = round(Aged_Dependency_Ratio*100, digits=1),
         Young_Dependency_Ratio = round(Young_Dependency_Ratio*100, digits=1)) %>%
  ungroup() 

head(combined_pop_grouped_inter_agg)
# A tibble: 6 × 8
   Year Region    Aged Working_Age  Young Population Aged_Dependency_…
  <dbl> <chr>    <dbl>       <dbl>  <dbl>      <dbl>             <dbl>
1  2000 Central 105240      612240 318730    1036210              17.2
2  2000 East     40290      371760 244500     656550              10.8
3  2000 North    17910      228050 153170     399130               7.9
4  2000 North …  39130      331880 199290     570300              11.8
5  2000 West     31890      352240 224910     609040               9.1
6  2001 Central 108050      609890 313010    1030950              17.7
# … with 1 more variable: Young_Dependency_Ratio <dbl>

5 Building the Visalisation

5.1 Customizing and Building the Population Pyramid Chart

To build the Population Pyramid chart there are a few components and geometric objects we need to be familiar with. The core components come from the plotly and ggplot library. It’s uses in this segment is explained as follows:

First, we will plot chart to show the trend of the young-age dependency ratio from 2000 to 2020. This will allow us to identify the population trends by regions of Singapore and pick out anomalies. We have first selected our x and y values and then color the fill by region and create the hoverinfo (tooltip information) as the region. We will then use the population as the size to show the different population size in Singapore. We will proceed to frame() the charts into a series of charts created by the year in which plotly() will play it as a timeline. Lastly, we will set up the layout to set the titles and text formats.

Notice how html elements were used to style the chart title and subtitle.

  fig <- combined_pop_grouped_inter_agg %>%
  plot_ly(
    x = ~Aged_Dependency_Ratio, 
    y = ~Young_Dependency_Ratio, 
    size = ~Population, 
    color = ~Region, 
    frame = ~Year, 
    text = ~Region, 
    hoverinfo = "text",
    type = 'scatter',
    mode = 'markers'
  )%>% 
  layout(title=
  "East has the fastest growing aged-dependency ratio trend
  <sup>Chart Represents the young-aged dependency ratio for the different regions of Singapore</sup>"
  , xaxis = list(title = list(text ='Age Dependency Ratio')), yaxis = list(title = list(text ='Young Dependency Ratio')))


fig

The above chart shows that the central region have the most number of people staying (most of the percentage of the population reside). However, despite being a larger hub, it still has a growing age dependency ratio. ‘Younger’ regions, like the north east and east particularly have a noticeable fast growth rate of the age-dependency score. It seems like between the year 2014 onwards, the East Region age-dependency score out grown the west, north and north east regions; who were increasing also. This speed is exceptionally different from the other and require more investigation.

To investigate further, lets plot the same chart but with only the east region of Singapore and fill the chart with the planning areas. To do so, we will create a new data frame that calculates the young-age dependency ratio using the same procedure as explained above but this time by regions.

We will then craft the ggplot() object and make use of the facet_wrap() function to create a facet grid plot to show the young-aged dependency ratio of all planning areas by the region. We will also specifc in the aes() a frame=Year; once we output the plot using the ggplotly() function, it will create the interactive chart and read the frame() aesthetics argument as the same frame() argument to create the year timeline.

combined_pop_grouped_inter_agg_PA <- combined_pop_grouped_inter %>%
  select(Year, Region ,Planning_Area, aged, young, working_age, Total_Population) %>%
  group_by(Planning_Area, Region, Year) %>%
   summarise(Aged=sum(aged),Working_Age=sum(working_age),Young=sum(young),Population=sum(Total_Population)) %>%
  mutate(Aged_Dependency_Ratio= Aged/Working_Age) %>%
  mutate(Young_Dependency_Ratio= Young/Working_Age) %>%
  mutate(Aged_Dependency_Ratio = round(Aged_Dependency_Ratio*100, digits=1),
         Young_Dependency_Ratio = round(Young_Dependency_Ratio*100, digits=1)) %>%
  #filter(Region == "East") %>%
  ungroup()


g<-combined_pop_grouped_inter_agg_PA%>%
  ggplot(aes(x = Young_Dependency_Ratio, y = Aged_Dependency_Ratio, label=Planning_Area, color = Region, frame=Year))+
  geom_point()+
  facet_wrap(vars(Region)) +
  labs(
    title = "Planning Areas have an increasing Aged Dependency Ratio Trend"
    , subtitle = "Regions of Singapore Young-Age Dependency Ratio by Planning Area"
    , x="Young Dependency Ratio" , y="Aged Dependency Ratio")
ggplotly(g)

Before interpreting the chart, we can further give the situation context by referring to the mature and non mature estate list to help us interpret the result.

Notice, form the above, that east only has four estates. With all four as a mature estate, this makes sense for the aged dependency to grow faster than other locations where more non-mature estate fall in like the North East; as there is not much newer housing for younger couples (BTOs). For example, North East show interesting patterns like Hougang and Seletar with higher young depdency ratio at first to higher aged dependency score.

To inestigate the East, we will drill down to one location for simplicity, that would be Bedok. Bedok, despite being a mature estate, accelerated in aged dependency score from 2014 onwards. We will investigate further into the age groups living in Bedok and how they the age-sex structure change over time.

To further dive into the problem, we can create an age-sex pyramid / population pyramid of seletar and yet provide interactive charting elements for users to be able to freely explore the data. The solution uses plotly library to build with elements explained earlier. The only difference is generating the update menu that will filter() the data set. It was done using crosstalk package to provide a highlighted data for sharing. The drop down menu will filter for shared data frame that will be passed to the plotly geometric object to plot the population pyramid. Also note, that for the axis range, we will make use of plyr signif function to round of the calculated axis range to the nearest thousand and use an appropriate range value to segregate.

Areas <- unique(combined_pop_grouped$Planning_Area)

get_tick_text <- function(axis_range) {
  round_axis_range <- signif(axis_range, digits = 1)
  text_val <- seq(round_axis_range, 0, -2000)
  t_text <- c(text_val, text_val)
  return (t_text)
}

get_tick_val <- function(axis_range) {
  round_axis_range <- signif(axis_range, digits = 1)
  t_val <- c(seq(round_axis_range, 0, -2000), seq(-round_axis_range, 0, 2000))
  return (t_val)
}

mrg <- list(l = 50, r = 50,
          b = 50, t = 50,
          pad = 20)
generate_graph <- function(axis_range, t_text, t_val) {
  fig <- plot_ly(sdf, 
                 x = ~Total_Population,
                 y = ~Age_Group, 
                 type = 'bar', 
                 hovertext = ~Population,
                 hoverinfo = 'text',
                 color = ~Gender,
                 colors = c("powderblue", "pink")
  ) %>% layout(
    title="Most planning areas in Singapore follow a constrictive trend 
    <sup>Interactive population pyramid showing age-sex strcutural trend from 2000 to 2020</sup>",
    bargap = 0.1, 
    barmode = 'overlay',
    yaxis = list (title = 'Age Group'),
    xaxis = list(range=c(-axis_range, axis_range),
                 tickvals= t_val,
                 ticktext= t_text,
                 title='Total Population'),
    margin  = mrg
  )
  fig
}

generate_graph_output <- function() {
  axis_range <- 14000
  t_text <- get_tick_text(axis_range)
  t_val <- get_tick_val(axis_range)
  generate_graph(axis_range, t_text, t_val)
}

# Wrap data frame in SharedData
sdf <- SharedData$new(combined_pop_grouped)

# Use SharedData like a dataframe with Crosstalk-enabled widgets
bscols(
  widths = c(2,NA),
       list(
         filter_select("Planning_Area", "Area", sdf, ~Planning_Area, multiple = FALSE),
         filter_select("Year", "Year", sdf, ~Year, multiple = FALSE)
   ),
  # Create a filter input    
  
  #data
  generate_graph_output()
)

Selecting Bedok, we can see that from 2014 to 2020, there is a shift in the older age groups upwards and yet the younger age groups seem to drop. This shows that there is a shrinking population trend of Bedok. This is alarming because Bedok while is known to have a larger older generation, it still have many BTO project. It shows that despite these projects, Bedok is still headed towards a shrinking population. In comparison, Hougang, a non-mature estate from 2014 to 2020 have a “slower shrinking population”.

References : Geogrpahical Ditribution of Singapore

We can further appreciate elements of visualization using a geom_point object to outlay the population pyramid as seen below :

bedok_df <- combined_pop_grouped %>%
  filter(Planning_Area == "Bedok") %>%
  dplyr::select(Year,Age_Group,Gender, Population,Total_Population, tooltips) 

ggplot(bedok_df, aes(x = Total_Population, y = Age_Group)) +
  geom_point(alpha = 0.7, 
             show.legend = TRUE,
             aes(color = factor(Gender))) +
  scale_colour_manual(values =  c("#E77878","#4682B4")) +
  scale_size(range = c(2, 12)) +
  labs(title = 'Population Pyramid trend of Bedok of Year: {frame_time}', 
       x = 'Total Population', 
       y = 'Age Group') +
  transition_time(as.integer(Year)) +
  ease_aes('linear') + 
  theme(panel.background = element_rect(fill = 'white', colour = 'white'), panel.grid = element_blank()) +
  labs(colour = "Gender") + 
  theme (legend.title = element_text(size=10, face="bold")
         ,legend.key = element_rect(fill = 'white', colour = 'white') )

The above code chunk users these few new elements which was not introduced before. These elements uses the gg-animate package to create the GIF for the motion chart to represent the age-sex structure for each year. This is done using:

  1. transition_time (https://gganimate.com/reference/transition_time.html): This is a variant of transition_states() that is intended for data where the states are representing specific point in time
  2. ease_aes (default is linear) : The ease_aes() function controls the easing of aesthetics or variables in gganimate

There are other transition parameters and function to use but it would be necessary to customize too much of the timing and frame structure.

Notice how we used the trend line of the dots to help create a visual effect of the trend changing over the years. Contrast of the points colors is also used against a white background to highlight the moving trend. At the same time keeping the titles and text in consistent colors and spacing to avoid distraction.

6 Conclusion and Interpretation

Population pyramids are important graphical representation to understand the composition of population members.It is typically visualized by grouping the population members into age cohorts and further diving the data points into their respective gender groups. In other words, the age-sex structure of specific populations. This makes it easy for demographers to compare the difference between male and female populations and the structure of the population at any given moment. Demographers typically use this to study the trend of populations relating to the fertility and mortality.

There are three trends in population pyramids they are typically: - expansive - constrictive - stationary

#three_trends

We shall focus our efforts in explaining the trend that is reflected in our plot.

6.1 Chart interpretation and conclusion

The Singapore population trend across all regions for both gender is depicted to follow a constrictive population pyramid trend with its ‘beehive’ shape as explain in take home exercise 1.

In this exercise, we found that the trends of the different regions of Singapore follow a similar trend. Furthermore, it seems like the aged dependency ratio is getting larger for most part of Singapore. We seen how older/mature estates are becoming more aged-dependent and also how non-mature estates despite having slight increased in young-dependency scores, still have a trend moving with higher aged-dependency scores. This sort of reflect the idea of how Singapore is increasing the working age limit since we are becoming for aged dependent.

6.2 Tableau vs R

  1. Animation Capabilities : Tableau provides a lot of animation capabilities in a click. This includes most of the animation aesthetics like customizing the tooltip to include sub charts and the creation of the motion chart done during the in class exercise. The animation, interval and pages are easily incorporated in tableau. One of the hardest thing to create was to tweet the animation settings in R as there is a need to reference the ggiraph library

  2. Customization Capabilities : While Tableau is easy to configure to the desired chart, certain customization abilities is only possible through R. For example, the use of functions to automatically adjust the ticks to detailed levels

  3. Smoothest” : In Tableau, running interactive charts seem to be smoother and much more responsive. This capability is fallen short in R. The charts seem to have a noticeable lag and producing highly customized charts have shown to slow down R studio a lot. This makes it had to test highly customized charts

Overall, Tableau provides as quick way to do up an already interactive chart. R, on the other hand, provides a lot of customization capabilities and flexibility. The two software have their own peaks at producing interactive charts, however, it would depend on the level of customization needed. Tableau is quick to market while R is highly flexible.

Comparison between ggplot and interactive libraries

It seems like the adaptation with JavaScript has helped ggplot plots to be adapted to be interactive. It shows the flexibility of creating these charts using the different libraries; each of which the customization of the charts would be different. How we layer and build on to existing geom objects would dependent on the library you use.